Dans ce TP nous allons pratiquer un peu de R, en explorant les données de la série Game of thrones avec la librairie
dplyr. Les données utilisées ont été collectées par Jeffrey Lancaster et sont issues de ce projet.
readr::read_csv().
str et summary sur les différentes table pour bien comprendre leurs strcutures et leurs relations.
nrow, ncol ou dim pour connaitres les dimensions des data.frame. Utiliser la fonction names pour connaitres le noms des colones et le l’opérateur in.
[1] 12114 2
[1] "sceneId"
name sex house killedBy
Length:587 Length:587 Length:587 Length:587
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
image
Length:587
Class :character
Mode :character
Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 587 obs. of 5 variables:
$ name : chr "Addam Marbrand" "Adrack Humble" "Aeron Greyjoy" "Aerys Targaryen" ...
$ sex : chr "male" "male" "male" "male" ...
$ house : chr NA NA "Greyjoy" NA ...
$ killedBy: chr NA NA NA NA ...
$ image : chr NA NA "https://images-na.ssl-images-amazon.com/images/M/MV5BNzI5MDg0ZDAtN2Y2ZC00MzU1LTgyYjQtNTBjYjEzODczZDVhXkEyXkFqcG"| __truncated__ NA ...
- attr(*, "spec")=
.. cols(
.. name = col_character(),
.. sex = col_character(),
.. house = col_character(),
.. killedBy = col_character(),
.. image = col_character()
.. )
sum.
[1] 376
table et sort.
Jon Snow Daenerys Targaryen Arya Stark Sandor Clegane
12 11 10 9
Cersei Lannister
7
which.max .
# A tibble: 1 x 9
sceneStart sceneEnd location subLocation episodeId duration nbc sceneId nbdeath
<time> <time> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 45'58" 56'59" The Crownl… Outside King's… 73 661 18 3795 0
arrange et head.
# A tibble: 1 x 9
sceneStart sceneEnd location subLocation episodeId duration nbc sceneId nbdeath
<time> <time> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 45'58" 56'59" The Crownl… Outside King's… 73 661 18 3795 0
left_join pour faire une jointure.
# A tibble: 18 x 10
sceneStart sceneEnd location subLocation episodeId duration nbc sceneId nbdeath name
<time> <time> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 45'58" 56'59" The Cro… Outside Ki… 73 661 18 3795 0 Tyri…
2 45'58" 56'59" The Cro… Outside Ki… 73 661 18 3795 0 Grey…
3 45'58" 56'59" The Cro… Outside Ki… 73 661 18 3795 0 Samw…
4 45'58" 56'59" The Cro… Outside Ki… 73 661 18 3795 0 Edmu…
5 45'58" 56'59" The Cro… Outside Ki… 73 661 18 3795 0 Arya…
6 45'58" 56'59" The Cro… Outside Ki… 73 661 18 3795 0 Bran…
7 45'58" 56'59" The Cro… Outside Ki… 73 661 18 3795 0 Sans…
8 45'58" 56'59" The Cro… Outside Ki… 73 661 18 3795 0 Brie…
9 45'58" 56'59" The Cro… Outside Ki… 73 661 18 3795 0 Davo…
10 45'58" 56'59" The Cro… Outside Ki… 73 661 18 3795 0 Gend…
11 45'58" 56'59" The Cro… Outside Ki… 73 661 18 3795 0 Yara…
12 45'58" 56'59" The Cro… Outside Ki… 73 661 18 3795 0 Robi…
13 45'58" 56'59" The Cro… Outside Ki… 73 661 18 3795 0 Yohn…
14 45'58" 56'59" The Cro… Outside Ki… 73 661 18 3795 0 Dorn…
15 45'58" 56'59" The Cro… Outside Ki… 73 661 18 3795 0 Lord…
16 45'58" 56'59" The Cro… Outside Ki… 73 661 18 3795 0 Lord…
17 45'58" 56'59" The Cro… Outside Ki… 73 661 18 3795 0 Lord…
18 45'58" 56'59" The Cro… Outside Ki… 73 661 18 3795 0 Lord…
group_by et summarize pour faire une aggrégation.
# A tibble: 26 x 2
location nbsc
<chr> <int>
1 The Crownlands 1252
2 The North 888
3 North of the Wall 342
4 The Wall 309
5 The Riverlands 301
6 Meereen 168
7 Braavos 103
8 The Vale 54
9 Dorne 49
10 The Reach 40
# … with 16 more rows
filter.
# A tibble: 0 x 2
# … with 2 variables: location <chr>, nbsc <int>
sum lors de l’aggrégation et la variable subLocation.
# A tibble: 97 x 2
subLocation nbd
<chr> <dbl>
1 King's Landing 74
2 Winterfell 60
3 <NA> 51
4 Castle Black 41
5 The Twins 12
6 The Haunted Forest 9
7 The Wall 9
8 Craster's Keep 8
9 The Wolfswood 8
10 Dragonstone 6
# … with 87 more rows
sum lors de l’aggrégation et faites des jointures pour pouvoir aggréger à l’échelle de l’épisode.
appearances %>% left_join(appearances,by=c("sceneId"="sceneId")) %>%
filter(name.x!=name.y) %>%
group_by(name.x,name.y) %>%
summarise(nbs=n()) %>%
arrange(desc(nbs))# A tibble: 8,472 x 3
# Groups: name.x [576]
name.x name.y nbs
<chr> <chr> <int>
1 Daenerys Targaryen Drogon 172
2 Drogon Daenerys Targaryen 172
3 Daenerys Targaryen Jorah Mormont 148
4 Jorah Mormont Daenerys Targaryen 148
5 Jon Snow Tormund Giantsbane 145
6 Tormund Giantsbane Jon Snow 145
7 Lord Varys Tyrion Lannister 141
8 Tyrion Lannister Lord Varys 141
9 Davos Seaworth Jon Snow 136
10 Jon Snow Davos Seaworth 136
# … with 8,462 more rows
appearances %>% left_join(appearances,by=c("sceneId"="sceneId")) %>%
filter(name.x!=name.y) %>%
left_join(scenes %>% select(sceneId,duration)) %>%
group_by(name.x,name.y) %>%
summarise(commonTime=sum(duration)) %>%
arrange(desc(commonTime))# A tibble: 8,472 x 3
# Groups: name.x [576]
name.x name.y commonTime
<chr> <chr> <dbl>
1 Daenerys Targaryen Jorah Mormont 12923
2 Jorah Mormont Daenerys Targaryen 12923
3 Lord Varys Tyrion Lannister 10764
4 Tyrion Lannister Lord Varys 10764
5 Davos Seaworth Jon Snow 10380
6 Jon Snow Davos Seaworth 10380
7 Daenerys Targaryen Missandei 9924
8 Missandei Daenerys Targaryen 9924
9 Jon Snow Tormund Giantsbane 9352
10 Tormund Giantsbane Jon Snow 9352
# … with 8,462 more rows
geom_bar mais spécifiez qu’aucune aggrégation ne doit être faites avec l’option stat='identity'
library(ggplot2)
jstime = appearances %>% filter(name=="Jon Snow") %>%
left_join(scenes) %>%
group_by(episodeId) %>%
summarise(time=sum(duration))
ggplot(jstime) +
geom_bar(aes(x=episodeId,y=time),stat='identity')+
theme_bw()+
xlab("épisode")+ylab("temps")+
ggtitle("Temps de présence par épisode de John Snow")reproducibility
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C LC_TIME=fr_FR.UTF-8
[4] LC_COLLATE=fr_FR.UTF-8 LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
[7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_3.3.0 dplyr_1.0.0 readr_1.3.1 knitr_1.29
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 RColorBrewer_1.1-2 pillar_1.4.3 compiler_3.6.3
[5] unilur_0.4.0.9000 tools_3.6.3 digest_0.6.25 evaluate_0.14
[9] tibble_2.1.3 lifecycle_0.2.0 gtable_0.3.0 pkgconfig_2.0.3
[13] rlang_0.4.7 cli_2.0.2 yaml_2.2.1 xfun_0.17
[17] withr_2.2.0 stringr_1.4.0 generics_0.0.2 vctrs_0.3.2
[21] hms_0.5.3 grid_3.6.3 tidyselect_1.1.0 glue_1.4.2
[25] R6_2.4.1 fansi_0.4.1 rmarkdown_2.3 farver_2.0.3
[29] purrr_0.3.4 magrittr_1.5 scales_1.1.0 htmltools_0.5.0
[33] ellipsis_0.3.0 assertthat_0.2.1 colorspace_1.4-1 labeling_0.3
[37] utf8_1.1.4 stringi_1.5.3 munsell_0.5.0 crayon_1.3.4